Sentiment analysis is the automated interpretation and classification of emotions (usually positive, negative, or neutral) from textual data such as written reviews and social media posts.
In this notebook, we will be looking at a Kaggle dataset "Amazon Fine Food Reviews" to perform analysis and determine if a review is positive or negative.
#Checking the head of the dataframe.
import pandas as pd
df = pd.read_csv('Reviews.csv')
df.head()
| Id | ProductId | UserId | ProfileName | HelpfulnessNumerator | HelpfulnessDenominator | Score | Time | Summary | Text | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | B001E4KFG0 | A3SGXH7AUHU8GW | delmartian | 1 | 1 | 5 | 1303862400 | Good Quality Dog Food | I have bought several of the Vitality canned d... |
| 1 | 2 | B00813GRG4 | A1D87F6ZCVE5NK | dll pa | 0 | 0 | 1 | 1346976000 | Not as Advertised | Product arrived labeled as Jumbo Salted Peanut... |
| 2 | 3 | B000LQOCH0 | ABXLMWJIXXAIN | Natalia Corres "Natalia Corres" | 1 | 1 | 4 | 1219017600 | "Delight" says it all | This is a confection that has been around a fe... |
| 3 | 4 | B000UA0QIQ | A395BORC6FGVXV | Karl | 3 | 3 | 2 | 1307923200 | Cough Medicine | If you are looking for the secret ingredient i... |
| 4 | 5 | B006K2ZZ7K | A1UQRSCLF8GW1T | Michael D. Bigham "M. Wassir" | 0 | 0 | 5 | 1350777600 | Great taffy | Great taffy at a great price. There was a wid... |
The data that we will be using most for this analysis is “Summary”, “Text”, and “Score.”
"Text" — This variable contains the complete product review information.
"Summary" — This is a summary of the entire review.
"Score" — The product rating provided by the customer.
the rating provided by the customer on a scale of 1-5, 5 being the most positive, 1 being the most negative
# Imports
import matplotlib.pyplot as plt
import seaborn as sns
color = sns.color_palette()
%matplotlib inline
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls
import plotly.express as px
# Product Scores
fig = px.histogram(df, x="Score")
fig.update_traces(marker_color="turquoise",marker_line_color='rgb(8,48,107)',
marker_line_width=1.5)
fig.update_layout(title_text='Product Score')
fig.show()
We can see that most of the reviews are positive based on the large number of reviews with a score of 4 or more.
'''#nltk (Natural Language Toolkit)
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from wordcloud import WordCloud
# Create stopword list:
#nltk.download('STOPWORDS')
stopwords = set('STOPWORDS')
stopwords.update(["br", "href"])
textt = " ".join(review for review in df.Text)
#nltk.download('punkt')
word_tokens = word_tokenize(textt)
clean_word_data = [w for w in word_tokens if w.lower() not in stopwords]
wordcloud = WordCloud(stopwords=stopwords).generate(clean_word_data)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.savefig('wordcloud11.png')
plt.show()'''
'#nltk (Natural Language Toolkit)\nimport nltk\nfrom nltk.corpus import stopwords\nfrom nltk.tokenize import word_tokenize\nfrom wordcloud import WordCloud\n\n# Create stopword list:\n#nltk.download(\'STOPWORDS\')\nstopwords = set(\'STOPWORDS\')\nstopwords.update(["br", "href"])\ntextt = " ".join(review for review in df.Text)\n#nltk.download(\'punkt\')\nword_tokens = word_tokenize(textt)\nclean_word_data = [w for w in word_tokens if w.lower() not in stopwords]\nwordcloud = WordCloud(stopwords=stopwords).generate(clean_word_data)\n\nplt.imshow(wordcloud, interpolation=\'bilinear\')\nplt.axis("off")\nplt.savefig(\'wordcloud11.png\')\nplt.show()'
Classify reviews into “positive” and “negative,” so we can use this as training data for our sentiment classification model.
Positive reviews will be classified as +1, and negative reviews will be classified as -1.
We will classify all reviews with ‘Score’ > 3 as +1, indicating that they are positive.
All reviews with ‘Score’ < 3 will be classified as -1. Reviews with ‘Score’ = 3 will be dropped, because they are neutral. This model will only classify positive and negative reviews.
# assign reviews with score > 3 as positive sentiment
# score < 3 negative sentiment
# remove score = 3
df = df[df['Score'] != 3]
df['Sentiment'] = df['Score'].apply(lambda rating : +1 if rating > 3 else -1)
df.head()
| Id | ProductId | UserId | ProfileName | HelpfulnessNumerator | HelpfulnessDenominator | Score | Time | Summary | Text | Sentiment | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | B001E4KFG0 | A3SGXH7AUHU8GW | delmartian | 1 | 1 | 5 | 1303862400 | Good Quality Dog Food | I have bought several of the Vitality canned d... | 1 |
| 1 | 2 | B00813GRG4 | A1D87F6ZCVE5NK | dll pa | 0 | 0 | 1 | 1346976000 | Not as Advertised | Product arrived labeled as Jumbo Salted Peanut... | -1 |
| 2 | 3 | B000LQOCH0 | ABXLMWJIXXAIN | Natalia Corres "Natalia Corres" | 1 | 1 | 4 | 1219017600 | "Delight" says it all | This is a confection that has been around a fe... | 1 |
| 3 | 4 | B000UA0QIQ | A395BORC6FGVXV | Karl | 3 | 3 | 2 | 1307923200 | Cough Medicine | If you are looking for the secret ingredient i... | -1 |
| 4 | 5 | B006K2ZZ7K | A1UQRSCLF8GW1T | Michael D. Bigham "M. Wassir" | 0 | 0 | 5 | 1350777600 | Great taffy | Great taffy at a great price. There was a wid... | 1 |
# split df - positive and negative sentiment:
positive = df[df['Sentiment'] == 1]
negative = df[df['Sentiment'] == -1]
df['sentimentt'] = df['Sentiment'].replace({-1 : 'negative'})
df['sentimentt'] = df['sentimentt'].replace({1 : 'positive'})
fig = px.histogram(df, x="sentimentt")
fig.update_traces(marker_color="indianred",marker_line_color='rgb(8,48,107)',
marker_line_width=1.5)
fig.update_layout(title_text='Product Sentiment')
fig.show()
This model will take reviews in as input. It will then come up with a prediction on whether the review is positive or negative.
This is a classification task, so we will train a simple logistic regression model to do it.
step 1: Data Cleaning
# First, we need to remove all punctuation from the summary data.
def remove_punctuation(text):
final = "".join(u for u in text if u not in ("?", ".", ";", ":", "!",'"'))
return final
df['Text'] = df['Text'].apply(remove_punctuation)
df = df.dropna(subset=['Summary'])
df['Summary'] = df['Summary'].apply(remove_punctuation)
''' Split the df:
The new data frame should only have two columns:
1. 'Summary' - (the review text data)
2. 'Sentiment' - (the target variable) '''
dfNew = df[['Summary','Sentiment']]
dfNew.head()
| Summary | Sentiment | |
|---|---|---|
| 0 | Good Quality Dog Food | 1 |
| 1 | Not as Advertised | -1 |
| 2 | Delight says it all | 1 |
| 3 | Cough Medicine | -1 |
| 4 | Great taffy | 1 |
now, the data frame will be into train and test sets.
80% of the data will be used for training, and 20% will be used for testing.
# random split train and test data
import numpy as np
index = df.index
df['random_number'] = np.random.randn(len(index))
train = df[df['random_number'] <= 0.8]
test = df[df['random_number'] > 0.8]
Next, we will use a count vectorizer from the Scikit-learn library to transform the text in our df into a bag of words model.
A Bag of Words model is used to preprocess the text by converting it into a 'bag of words', which keeps a count of the total occurrences of most frequently used words.
We need to convert the text into a bag-of-words model since the logistic regression algorithm cannot understand text.
# count vectorizer:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(token_pattern=r'\b\w+\b')
train_matrix = vectorizer.fit_transform(train['Summary'])
test_matrix = vectorizer.transform(test['Summary'])
# Logistic Regression
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
# Split target and independent variables
X_train = train_matrix
X_test = test_matrix
y_train = train['Sentiment']
y_test = test['Sentiment']
#fit model
lr.fit(X_train,y_train)
c:\Users\Charlie\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\linear_model\_logistic.py:763: ConvergenceWarning:
lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
LogisticRegression()
# Make predictions
predictions = lr.predict(X_test)
# Confusion
# find accuracy, precision, recall:
from sklearn.metrics import confusion_matrix,classification_report
new = np.asarray(y_test)
confusion_matrix(predictions,y_test)
array([[11547, 2431],
[ 5757, 91759]], dtype=int64)
# Classification report
print(classification_report(predictions,y_test))
precision recall f1-score support
-1 0.67 0.83 0.74 13978
1 0.97 0.94 0.96 97516
accuracy 0.93 111494
macro avg 0.82 0.88 0.85 111494
weighted avg 0.94 0.93 0.93 111494
The overall accuracy of the model on the test data is around 93%, which is pretty good considering we didn’t do any feature extraction or much preprocessing.